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Abstract 

Background: To interpret differentially expressed genes or other discovered features, re- 
searchers conduct hypothesis tests to determine which biological categories such as those of the 
Gene Ontology (GO) are enriched in the sense of having differential representation among the 
discovered features. Multiple comparison procedures (MCPs) are commonly used to prevent 
excessive false positive rates. Traditional MCPs, e.g., the Bonferroni correction, go to the oppo- 
site extreme of strictly controlling a family-wise error rate, resulting in excessive false negative 
rates. Researchers generally prefer the more balanced approach of instead controlling the false 
discovery rate (FDR). Methods of FDR control assign q- values to biological categories, but q- 
values are too low to reliably estimate a probability that the biological category has equivalent 
representation among the preselected features. Thus, we study application of better estimators 
of that probability, which is technically known as the local false discovery rate (LFDR). 

Results: We identified three promising estimators of the LFDR for detecting differential 
representation: a semiparametric estimator (SPE), a normalized maximum likelihood estimator 
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(NMLE), and a maximum likelihood estimator (MLE). We found that the MLE performs at 
least as well as the SPE for on the order of 100 of GO categories even when the ideal number of 
components in its underlying mixture model is unknown. However, the MLE is unreliable when 
the number of GO categories is small compared to the number of PMM components. Thus, 
if the number of categories is on the order of 10, the SPE is a more reliable LFDR estimator. 
The NMLE depends not only on the data but also on a specified value of the prior probability 
of differential representation. It is therefore an appropriate LFDR estimator only when the 
number of GO categories is too small for application of the other methods. 

Conclusions: For enrichment detection, we recommend estimating the LFDR by the MLE 
given at least a medium number (~100) of GO categories, by the SPE given a small number 
of GO categories (~10), and by the NMLE given a very small number (~1) of GO categories. 

Keywords: empirical Bayes, gene enrichment, gene expression, Gene Ontology local false discovery 
rate, minimum description length, multiple comparison procedure, normalized maximum likelihood, 
simultaneous inference 
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Introduction 



The development of microarray techniques and high-throughput genomic, proteomic, and bioin- 
formatics scanning approaches (such as microarray gene expression profiling, mass spectrometry, 
and ChlP-on-chip) has enabled researchers simultaneously to study tens of thousands of biologi- 
cal features (e.g., genes, proteins, single-nucleotide polymorphisms [SNPs], etc.), and to identify 
a set of features for further investigation. However, there remains the challenge of interpreting 
these features biologically. For a given set of features, the determination of whether some bio- 
logical information terms are differentially represented (i.e., overrepresented or underrepresented) , 
compared to the reference feature set, is termed the feature enrichment problem. The biological 
information term may be, for instance, a Gene Ontology (GO) category []| or a pathway in the 
Kyoto Encyclopedia of Genes and Genomes (KEGG) [20]. 

This problem has been addressed using a number of high-throughput enrichment tools, including 



DAVID 



lfj, MAPPFinder [ ljj , Onto-Express 



21|, and GoMincr 



3l|. Huang et al. [18j reviewed 



68 distinct feature enrichment analysis tools. These authors further classified feature enrichment 
analysis tools into 3 categories: singular enrichment analysis (SEA), gene set enrichment analysis 
(GSEA) and modular enrichment analysis (MEA). Here, we investigate the SEA problem using 
gene expression as a concrete example. More precisely, we consider whether some specific biological 
categories are differentially represented among the preselected genes with respect to the reference 
genes. We call this problem the gene enrichment problem. 

Existing enrichment tools mainly address the gene enrichment problem using a p- value obtained 
from an exact or approximate statistical test (e.g., Fisher's exact test, or the hypergeometric test, 
binomial test, x 2 test, etc.). For each GO term or other biological category, the null hypothesis 
tested and its alternative hypothesis are these: 

Ho : the GO category is equivalently represented among the preselected genes 
Hi : the GO category is differentially represented among the preselected genes 

The general process begins as follows: 

• For each GO category, construct Table [1] based on the preselected genes (e.g., differentially 
expressed (DE) genes) and reference genes (e.g., all genes measured in a microarray experi- 
ment). 

• Compute the p-value for each GO category using a statistical test that can detect enrichment 
in the sense of differential representation among the preselected genes. 
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Table 1: The number of differentially expressed (DE) and equivalently expressed (EE) genes in a 
GO category. Here, Xi (i — 1,2) is the number of DE (i = 1) or EE genes (i = 2) in the GO 
category; n is the total number of DE genes; N is the total number of reference genes. 

DE genes EE gene Total 
In GO category x\ x^ X\ + x 2 

Not in GO category n — x\ N — n — xi N — x\ — x-i 
Total n N-n N 



Multiple comparison procedures (MCPs) are then applied to the resulting p- values to prevent 
excessive false-positive rates. The false discovery rate (FDR) [3J| is frequently used to control the 
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expected proportion of incorrectly rejected null hypotheses in gene enrichment studies 
because it has lower false-negative rates than the Bonferroni correction and other methods of 
controlling the family-wise error rate. Methods of FDR control assign q-values [28| to biological 
categories, but q-values are too low to reliably estimate the probability that the biological category 
has equivalent representation among the preselected features. Thus, we study application of better 
estimators of that probability, which is technically known as the local false discovery rate (LFDR) . 
Hong et al. [l^l used an LFDR estimator to solve a GSEA problem and pointed out that this 
was less biased than the q-value for estimating the LFDR, the posterior probability that the null 



hypothesis is true. 



Efron [12|,[13| devised reliable LFDR estimators for a range of applications in microarray gene 
expression analysis and other problems of large-scale inference. However, whereas microarray gene 
expression analysis takes into account tens of thousands of genes, the gene enrichment problem 
typically concerns a much smaller number of GO categories. While those methods are appropriate 
for microarray-scale inference, they are less reliable for enrichment-scale inference Bickel 4, {J. 
Thus, we will specifically adapt three types of LFDR estimators that are appropriate for smaller- 
scale inference to address the SEA problem. Here we will focus on genes and GO categories. 
Nevertheless, the estimators used can be broadly applied to other features (e.g., proteins, SNPs) 
and biological terms (e.g., those featuring metabolic pathways). 

The sections of this paper are arranged as follows. We will first introduce some preliminary 
concepts in the gene enrichment problem. Next, 3 LFDR estimators will be described. After that, 
we will compare the LFDR estimators using breast cancer data and simulation data. Finally, we 
will draw conclusions and make recommendations on the basis of our results. 
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Preliminary concepts 



The gene enrichment problem described in the Introduction is stated here more formally for appli- 
cation of LFDR methods of the next section. 

Likelihood functions 

Consider Table Q] Let X\ and X2 respectively denote the random numbers of DE and EE genes in 
a GO category. The resulting categories follow the binomial distribution, i.e.,Xi ~ Binomial(n, IT) 
and X2 ~ Binomial(7V — n,!^), where IT is the proportion of DE genes in the GO category and 
II2 is the proportion of EE genes in the GO category. Under the assumption that X\ and X2 are 
independent, the unconditional likelihood is 

L{Jl 1 ,Jl 2 ;x 1 ,x 2 ,n,N) (2) 
= Pi(X 1 = x u X 2 = ar 2 ;ni,n 2l n,iV) 

= ^"V^^nf (i-ni) n - xi nf (l-na)^" -352 

where < xi < n, < x 2 < N - n and < 11, < 1, % = 1, 2. 
If we define 

A = in[n 2 /(i - n 2 )] (3) 

and 

6 = lnpi/U - Hi)] - A (4) 

then 9 is the parameter of interest, representing the log odds ratio of the GO category, and A is a 
nuisance parameter. Under the new parametrization, the unconditional likelihood function ^ is 

l(6,\ ]XiM n) = 7/;; + i )ni(1+eA)ma (5) 

where < x\ < n and < x% < N — n. 

In equation ([S]), we take the interest parameter 6 and also the nuisance parameter A into con- 
sideration. Consider statistics T and S, functions of X\ and X2, such that T{X\, X2) = X\ and 
S(Xi,X2) = Xi + X<x. Thus, T represents the number of DE genes in a GO category, and S rep- 
resents the number of total genes in a GO category. Let t and s be the observed values of statistic 
T and S. The probability mass function of T{x\ 1 X2) — t evaluated at S(xi, X2) — x% + X2 = s, say 
Pr(T = i|iS = s;6,X,N,n), does not depend on the nuisance parameter A [4|. See also Example 



8.47 of Severini |27j |. Thus, we derive the conditional probability mass function 

fe(t\s) = Pr(T = t\S = s;6,n,N) = . , VtA "~* ; _ (6) 

2-ij=max(0,s+n-N) \j) \ s~j J 

understood as a function of t. 

By eliminating the nuisance parameter A, we can reduce the original data x\ and X2 by consid- 
ering the statistic T — t. However, the use of the conditional probability mass function requires 
some justification because of concerns about losing information during the conditioning process. 
Unfortunately, in the presence of the nuisance parameter, the statistic S{X\,X%j = X\ + X2 is 
not an ancillary statistic for the parameter of interest. In other words, the probability mass func- 
tion of the conditional variable S(Xi,X2) may contain some information about the parameter 9 
{27]]. However, following the explanation of Barndorff-Nielsen and Cox [2, §2.5], the expectation 
value of the statistic S(Xi,X2) equals the nuisance parameter. Hence, from the observation of 
S(Xi,X2) alone, the distribution of the statistic S(Xi,X2) contains little information about the 
parameter 9 Q|. The statistic S(Xi,X2) satisfies the other 3 conditions of an ancillary statistic 
defined by Barndorff-Nielsen and Cox [2j: parameters 9 and A are variation independent; the statis- 
tic (T(X±, X2), S(Xi, X2)) is the minimal sufficient statistic; and the distribution of the statistic 
T(Xi, X2), given S(Xi,X%) — s, is independent of parameter of interest, 9, given the nuisance 
parameter A. Therefore, the probability mass function of the statistic S(X\,X2) contains little 
information about the value of the parameter 9. 

Hypotheses and false discovery rates 

Considering GO category i, we denote the T, S, t, s and 9 used in equation ([5]) as Tj, Si, ti, Si and 
6{. From Table [TJ the hypothesis comparison ((1) of GO category i is equivalent to 

di = versus 9 t ^ (7) 

Let S = (Si, S2, • • • , S m ) and s = (si, S2, ■ • ■ , s m ). Let BF^ denote the Bayes factor of GO category 
i: 

= Pr(y t = t t |S = s,^/0) 
' p r (T j=tj |s = B ,^ = 0) {> 
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It is called the Bayes factor because it yields the posterior odds when multiplied by the prior odds. 
More precisely, the posterior odds of the alternative hypothesis corresponding to GO category i is 

Ui pr^ = 01*0 - BFi x (9) 

where 7To is the prior conditional probability that a GO category is equivalently represented among 
the preselected genes given s, i.e., ttq = Pr(6>, = 0|S = s). Thus, (1 — tto)/tto is the prior odds of 
the alternative hypothesis of differential representation. According to Bayes' theorem, the LFDR 
of GO category i is 



LFDR, = Pr(0i = 0\U) = — !— , (10) 

1 + U>i 



where u>i is defined in equation 



LFDR estimation methods 
Semiparametric LFDR estimator 

Let a denote any significance level chosen to be between and 1. For all GO categories of interest, 
the FDR may be estimated by 

/ mcx \ 

FDR(a) = min — _ ,1 (11) 

where m is the number of GO categories, pj is the p-value of GO category j, and l{ p .< Q } is the 
indicator such that \{p i < a } = 1 if Pj < ot is true and l{ Pj < Q ,} = otherwise. Thus, J~] -y l{p 3 < a } 
represents the number of GO categories with discovered differential representation, and ma esti- 
mates the number of such discoveries that are false. 

Let 7"i be the rank of the p-value of GO category i, e.g., r, = 1 if the p-value of GO category i 
is the smallest among all p- values of m GO categories. Based on a modification of equation (|TTj) , 
the semiparametric estimator (SPE) of LFDR of the GO category i is 



LFDR, = i y n 3 (12) 



I . n > f 

It is conservative in the sense that it tends to overestimate the LFDR 
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Type II maximum likelihood estimator 

Bickel [5] follows Good [15] in calling the maximization of likelihood over a hyperparameter Type 
II maximum likelihood to distinguish it from the usual Type I maximum likelihood, which pertains 
only to models that lack random parameters. Type II maximum likelihood has been applied to 
parametric mixture models for the analysis of microarray data 
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23|, proteomics data [9(, and 



genetic association data [30]. In this section, we adapt the approach to the gene enrichment problem 
by using the conditional probability mass function defined above. 

Let G(s) = {gg(*\s); 6 > 0} be a parametric family of probability mass functions with 

9o{»\b) = ^x[/ 9 (.|s)+/_ e (.|s)] (13) 

where /g(»|s) is defined in equation (|6]). We define the k-component parametric mixture model 
(k-component PMM) as 

fc-i 

0(«|s;0o,.. M fe _i,7ro,...,7r fe -i) = ^2n jgej (»\s) (14) 

i=0 

where do = and 6j ^ 9j, if j ^ J. 

Let T = (Ti, T2, • • • , T m ) and t = (ti,t^, ■ ■ ■ , t m ) be vectors of the T^s and Us used in equation 
([5]). Assuming Ti is independent of Tj and Sj for any i ^ j, the joint probability mass function is 



g(t\s;9 ,...,9 k ^i,ir Q ,...,Trk-i) = J\g(t t \s;9 Q , . . . ,9 k -i,^o, ■ ■ ■ ,^k-i) (15) 

i=l 
m 

= JJ_g(ti\si; 9 Q ,..., 6» fc _i, 7r , . . . , 7r fe _i) 

i=l 

where Si is the observed value of Si for GO category i, and s = (si, s%, ■ ■ ■ , s m ). 

Moreover, we assume that for given the number of genes in GO category i, Ti {i = 1, . . . , m), 
satisfies the fc-component PMM shown in equation (fT4")h In other words, we assume that the possible 
log odds ratios of GO category i are the 9q, 61,62, ■■■ ,9 k -i of equation (| 14|) if the alternative 
hypothesis Hi in the hypothesis comparison ((7]) is true. 

Therefore, the log-likelihood function under the fc-component PMM for all GO categories is 



logL(6> , • • • , 9k-i, To, . . . , Hk-i) = logg(t|s; 9 ,..., 6 k -i,ir , 7r fe _i) 

m k — 1 

= H ^g^iTjge^tilsi) (16) 

i=l j=0 
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The LFDR of GO category i is estimated by 

LFDR- fe) = Jo9Mfd_ _ 

g(ti\sf,9o, #i, • • • , 6k-i,TTo, • • • , 7r fc _i) 

where 9\, . . . , 9k-i and no, . . . , rck-i are maximum likelihood estimates of 9i, . . . , 0fc-i and 7To, . . . , 7Tfe_i 

nA 

in equation (|16p . We call LFDR^ the k-component maximum likelihood estimator (MLEfc). 
LFDR estimator based on the normalized maximum likelihood 

Combining equations (|9"|- (fTU)) . we obtain 

LFDR t = ( 1 + BFj x ^ -7r °^ (18) 



Therefore, given a guessed value of ttq, we may use an estimator of the Bayes factor to estimate the 
LFDR of a GO category. 

We next develop such an estimator of the Bayes factor. For GO category i, let Ei stand for 
the set of all probability mass functions defined on {0, 1, . . . , Si}, the set of all possible values of U. 
Based on the hypothesis comparison (J7J, the set of log odds ratios, denoted as O, is {0} under the 
null hypothesis and is the set of all real values except under the alternative hypothesis. With the 
assumption that the random variable Tj is independent of the random variable Sj for any i ^ j, 
the regret of a predictive mass function / e Si is a measure of how well it predicts the observed 
value tj e {0, 1, . . . , Si}. The regret is defined as 

re g (M| Sl ;0) = log%^ (19) 



where 9i(ti\si) is the Type I MLE with respect to the O under the observed values tj given Sj 

For all members of Ei, the optimal predictive conditional probability mass function of GO cate- 
gory i, denoted as /J, minimizes the maximal regret in the sample space {0, 1, . . . , Si} in the sense 
that it satisfies 

// = axgmin max reg(/, t|sjj O) (20) 
feSi te{o,i,...,s < } 

It is well known [l(| that the predictive probability mass function that satisfies equation ([2U|) is 



A f . | n ^ maxe ee /e(t t [g t ) / ,s 

A (**!-*, e) = 4(e) ( 21 ) 



9 



where fe(ti\si) is the conditional probability mass function defined in equation (JSJ), and /Cj(O) is 
the constant defined as 



/Cj(6) = max/ e (j/| Si ) (22) 

min(sj ,ni ) 

max/ e (2/| Sl ) 



9ee 

y— max(0,Sj- 712) 

min(s i: ni) ™ 2 )e y ^^ 

X/ y^min(s,ni) /nj\ I n 2 ) p j§i(y) 

y=max(0,Si— n 2 ) ^j^maxfO.Si — n 2 ) V j J Is;— j/ 



where 



9i(y) = arg max I Si) (23) 



We call //(tj|si; B) the normalized maximum likelihood (NML) associated with the hypothesis that 

0< e e. 

Thus, BFi is estimated by 



BF.; = H^ll^Lf^l ^ (24 ) 



-t_ //(tiM^o) 

0) 

which we call the the NML ratio. Therefore, by combining equations (|8J and ©, if we guess the 
prior probability ttq, the LFDR estimate of GO category i in the hypothesis comparison ([7]) is 



— t ' ' ' 

LFDR,- 



1 - 7T t 

1 + x BF, 

7T0 



(25) 



where BF^ is defined in equation (|24l) . We call this LFDR estimator the normalized maximum 
likelihood estimator (NMLE). 

To assess the performance of the NML ratio Bf\ , it will be compared to the following estimate 
of the Bayes factor. Equations (fT8|) and (flTj) suggest 

&, . i^£5Rf „ i^a (26) 



LFDR 

as an MLE-based estimator of BF, . 



« 7T 
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Results 



Breast cancer data analysis 

The data set used here is from an experiment applying an estrogen treatment to cells of a human 



breast cancer cell line [26j. The data, which is available from the Bioconductor project, contains 8 
Affymetrix HG-U95Av2 CEL files from an estrogen receptor-positive breast cancer cell line. (For 
further information concerning the data and also the Bioconductor project, see Gentleman et al. 
jl^.) For simplicity of terminology, we consider probes in the microarray experiment as genes, and 
use the 12,625 genes expressed in the microarray experiment as a reference. 

We selected as genes of interest those that were differentially expressed between two groups 
according to the following criterion. Using the LFDR as the probability that a gene is EE, we 
considered genes with LFDR estimates below 0.2 as DE. In other words, we selected as DE genes 
those were differentially expressed with estimated posterior probability of at least 80%. We used 
the 2-sample t-test with equal variances to compute the p- value of each gene in the microarray. The 
LFDR of every gene is estimated using the theoretical null hypothesis method of Efron flj . [ljj]; 
empirical null hypotheses can lead to excessive bias due to deviations from normality [8] . When we 
compared gene expression data for the presence and absence of estrogen after 10 hours of exposure, 
we obtained 74 DE genes. 

Defining unrelated pairs of GO categories as those that do not share any common ancestor, 
we selected for analysis all unrelated GO molecular function categories with at least 1 DE gene, 
thereby obtaining a total of 82 GO categories of interest. For each GO category, the p-value used 
in SPE to estimate LFDR is computed based on the 2-sided Fisher's exact test. Figure Q] compares 
the SPE to the MLEs based on the 2-component (MLE2) and 3-component (MLE3) PMM. Figure 
[5] displays the probability mass of GO:0005524 under the null and alternative hypotheses of the 
hypothesis comparison ([7]) . Figure [3] compares MLE-based estimates of the Bayes factor given by 
equation (f2"6")l to the NML ratios given by equation (f2"6")l . 
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Figure 1: Comparison of the LFDR estimated by the SPE with the LFDR estimated by the MLE2 
(left) and MLE3 (right). Each integer represents a number of GO categories. Intergers > 1 indicate 
ties. 
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Figure 2: The conditional probability mass functions given the number of genes in GO:0005524 
under a null hypothesis, and alternative hypotheses based on the 2-component PMM (left) and 
3-component PMM (right). The grey dashed line is the number of DE genes in GO:0005524. 



Simulation studies 

The aim of the following simulation studies is to compare the LFDR estimation bias of SPE, MLE2, 
and MLE3. The NMLE is not taken into account because its performance depends not only on the 
data, but also on the specified prior probability n . 

The simulation setting involves 10, 000 genes in a microarray with 200 genes identified as DE 
and 100 GO categories. We conducted a separate simulation study using each of these values of no- 
50%, 60%, 70%, 80%, 90%, and 94%. 

Since the PMM behind the MLE is optimal when the number of GO categories with overrepre- 
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Figure 3: Comparison of the Bayes factor estimated by the NML ratio with that estimated by the 
MLE2 (left) and MLE3 (right). The integers are defined in Figure [TJ The grey dashed lines mark 
commonly used thresholds for strong and overwhelming evidence \WL 0] • 

sentation ("enrichment") is equal to the number with underrepresentation ("depletion") , we assessed 
the sensitivity of the MLE to that symmetry assumption by using strongly asymmetric log odds 
ratios as well as those that are symmetric. For each GO category, two configurations were used in 
this simulation to choose log odds ratios: the asymmetric configuration shown in equation (|27[) and 
the symmetric configuration shown in equation 



^asymmetric 



loofej > 1<*< 100(1 -tto) 



(27) 



0, 



^symmetric 



100(1 -n ) <i< 100 
l<i< 50(1 -tto) 



10(l-7ro) ' 

5 -T0(T^T> 50(1 - 7T ) < z < 100(1 - 7T ) 

0, 100(1 - 7T ) < % < 100 



(28) 



Considering the log odds ratios of all 100 GO categories constructed by either the asymmetric 
or the symmetric configuration, we generated Table Q] for each GO category as follows: 

• x\ is generated from a binomial distribution with the parameter IT used in equation IT 
is a real value randomly picked from to 1 . 



xi is obtained from a binomial distribution with the parameter II2 
obtained by solving equation (QJ. 



(i-ni)x2 8 
rii 
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The p-value of each GO category used in the SPE is obtained from 2-sided Fisher's exact test. 
The fc-component PMM (k — 2 or k — 3) used in the MLE is shown in equation ([Til) with wj = 
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Figure 4: The performance of LFDR estimators for equivalently (dashed line) or differentially (solid 
line) represented GO categories. 

(1 — 7To) jk \j = 1, . . . , k] and gg.(»\s) — ge^t^Si) defined in equation (j 13[) . For every log odd ratio 
sequence, we estimated the LFDR 20 times using the SPE, MLE2, and MLE3. We compared 
the performances of the 3 estimators by means of estimating the LFDR bias. The true LFDR is 
computed by equation (1101) . where 



fo(U) 



Emu 
7= 



and fi(U) is computed by 



j=max(0,Si+n— TV) \j 



i J 

J 3=1 



(n\ (N-n\ 
n-N) \ j) \si-j) 



where fo(t\s) is defined in equation ([5]). 

Figure f?] shows the performance comparisons of the 3 LFDR estimators for simulation data 
obtained from the symmetric and asymmetric log odds ratios. The LFDR biases estimated by the 
SPE and MLE2 are similar. The LFDR estimated by the MLE3 provides the lowest bias among 
the 3 LFDR estimators. Moreover, the estimated LFDR biases of the estimators are not strongly 
affected by whether the log odds ratios are symmetric or asymmetric. Furthermore, the bias of the 
LFDR estimated by the SPE decreases as 7To, the probability that GO categories are equivalently 
represented, increases. However, the LFDR estimate attains a negative bias if no is higher than 
80%. In other words, some equivalently represented GO categories are declared as differentially 
represented GO categories. 
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Conclusions 



Efron's method [12, 13] can be used to estimate GO categories and thus address the gene enrichment 
problem, provided that thousands of GO categories are taken into account. However, in most gene 
enrichment studies, researchers focus on medium- or small-scale numbers of GO categories, i.e., 
several hundred, dozens or only one GO category. Here, we adapted 3 LFDR estimators (the SPE, 
MLE, and NMLE) to address the gene enrichment problem with medium- and small-scale numbers 
of GO categories, and compared these using breast cancer and simulation data. 

The MLE is sensitive to k, the number of PMM components. The MLE is used when considering 
a medium-scale number of GO categories, i.e., 100. In our breast cancer data analysis, the estimated 
LFDRs of GO:0051082 and GO:0005524 using MLE2 were 100% (Figure [TJ. However, the LFDRs 
estimated by MLE3 were very close to 0. Using the MLE formula shown in equation (fl7|) . and the 
fc-component PMM shown in equation (fT4)) . we determined that the sensitivity of the LFDRs of GO 
category i estimated by MLE2 and MLE3 depended mainly on the sensitivity of the Bayes factor, 
based on the number of PMM components. Comparing the probability masses of GO:0005524, 
based on the 2- and 3-component PMMs shown in Figure^ we found that the probability mass of 
GO: 0005524 under the null hypothesis is larger than that under the alternative hypothesis based 
on the 2-component PMM (left plot in Figure [5]). By contrast, the probability mass under the null 
hypothesis is smaller than that under the alternative hypothesis based on the 3-component PMM 
(right plot in Figure [3]). Thus, the LFDR estimated by the MLE is strongly dependent on the 
number of PMM components. 

Nevertheless, the performance comparison in Figure @] indicates that the MLE has lower bias 
than the SPE when the number of GO categories is much larger than k even when the ideal value of 
k is unknown. Moreover, MLE3 has lower bias than MLE2 as an LFDR estimator. However, when 
the number of GO categories is not much larger than k, the estimated proportion of GO categories 
equivalently represented become strongly biased toward 0. In that situation, the false positive rate 
increases as the number of PMM components. 

Due to its conservatism and freedom from the PMM, we recommend using the SPE when the 
number of GO categories of interest is too small for the MLE, e.g., about 10 categories. Based on the 
simulations reported by Bickel we conjecture that the SPE has acceptably low LFDR-estimation 
bias when there are at least 3 GO categories. 

Finally, we recommend that the NMLE be used given only 1 or 2 GO categories of interest. 
Neither the MLE nor the SPE is able to estimate the LFDR for only 1 GO category of interest; 
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moreover, they probably have excessive bias when based on only 2 GO categories. Thus, the NMLE 
is the recommended method of addressing the gene enrichment problem in this smallest-scale case. 
The NMLE depends not only on the data but also on a guess of the value of tt , which, in the 
absence of strong prior information, is often set to the default value of 50%. A closely related 
approach is to use the NML ratio as an estimate of the Bayes factor directly without guessing 
7To. By using 10 and 100 as thresholds of each estimated Bayes factor to determine whether a GO 
category is differentially represented, we reached similar conclusions whether using the NML and 
or an MLE (Figure [3]). Thus, at least for our data set, the NML ratio tends to estimate the Bayes 
factor almost as accurately as methods that simultaneously use information across GO terms. 
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